1. Introduction: Project Overview¶
This project centers around the comprehensive analysis of three distinct datasets: movie Data, critic reviews, and user reviews. The scope of the project encompasses various aspects, including:
- Data Cleaning: Ensuring data integrity and quality for accurate analysis.
- Data Analysis: Identifying and evaluating the best movies based on various metrics.
- Data Visualization: Creating insightful visual representations of the data to facilitate understanding.
- Linear Regression: Applying statistical methods to predict audience scores based on relevant features.
- Natural Language Processing (NLP): Analyzing textual data from reviews to extract meaningful insights.
Below are the summarized details of the three datasets presented in their respective dictionaries:
Data Dictionary for Movies¶
| Column Name | Data Type | Description |
|---|---|---|
movieId |
object | Unique identifier for each movie. |
movieYear |
int64 | The year the movie was released. |
movieURL |
object | URL to the movie's page on Rotten Tomatoes. |
movieTitle |
object | The title of the movie. |
critic_score |
float64 | Average score given by critics for the movie. |
critic_sentiment |
object | Sentiment of the critic's reviews (e.g., positive, negative, neutral). |
audience_score |
float64 | Average score given by audience members for the movie. |
audience_sentiment |
object | Sentiment of audience reviews (e.g., positive, negative, neutral). |
release_date_theaters |
object | The release date of the movie in theaters. |
release_date_streaming |
object | The release date of the movie on streaming platforms. |
rating |
object | The movie's rating (e.g., PG, PG-13, R). |
original_language |
object | The original language of the movie (e.g., English, French). |
runtime |
object | The runtime of the movie in minutes (may be stored as text in some cases, e.g., "120 minutes"). |
Data Dictionary for Critic Reviews¶
| Column Name | Data Type | Description |
|---|---|---|
reviewId |
int64 | Unique identifier for each critic review. |
movieId |
object | Unique identifier for the movie that the review corresponds to. |
creationDate |
object | The date when the review was created or published. |
criticName |
object | The name of the critic who wrote the review. |
criticPageUrl |
object | URL to the critic's page on Rotten Tomatoes or the publication website. |
reviewState |
object | The state of the review (e.g., published, draft). |
isFresh |
bool | A boolean indicator whether the review is categorized as "Fresh" (positive review). |
isRotten |
bool | A boolean indicator whether the review is categorized as "Rotten" (negative review). |
isRtUrl |
object | URL to the Rotten Tomatoes page for this review (if available). |
isTopCritic |
bool | A boolean indicator whether the review was written by a "Top Critic" on Rotten Tomatoes. |
publicationUrl |
object | URL to the publication's homepage or the article containing the review. |
publicationName |
object | The name of the publication where the review was published. |
reviewUrl |
object | Direct URL to the individual review page on the publication's website. |
quote |
object | A brief excerpt or quote from the review. |
scoreSentiment |
object | The sentiment of the review (e.g., positive, negative). |
originalScore |
object | The original score given by the critic, if available. |
Data Dictionary for User Reviews¶
| Column Name | Data Type | Description |
|---|---|---|
movieId |
object | Unique identifier for the movie that the user review corresponds to. |
rating |
float64 | The rating given by the user, typically a score between 0 and 10 (or similar scale). |
quote |
object | A brief excerpt or comment from the user's review. |
reviewId |
object | Unique identifier for each user review. |
isVerified |
bool | A boolean indicating whether the user who wrote the review is a verified user. |
isSuperReviewer |
bool | A boolean indicating whether the user is marked as a "Super Reviewer" (frequent and/or trusted user). |
hasSpoilers |
bool | A boolean indicating whether the review contains spoilers. |
hasProfanity |
bool | A boolean indicating whether the review contains profanity. |
score |
float64 | The score given by the user (if different from rating). |
creationDate |
object | The date when the user review was created or published. |
userDisplayName |
object | The display name of the user who wrote the review. |
userRealm |
object | The region or realm associated with the user (could be location-based information). |
userId |
object | Unique identifier for the user who wrote the review. |
2. Importing Libraries and Loading Data¶
To start our analysis, we import the necessary libraries for data manipulation, visualization, and natural language processing (NLP). Libraries such as Pandas, Matplotlib, Seaborn, NLTK, and Scikit-learn provide powerful tools for handling our datasets.
We then load three CSV files containing movie details, critic reviews, and user reviews into separate DataFrames. This is the foundation for our analysis and will allow us to combine different perspectives on movies.
import os
import gdown
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from textblob import TextBlob
from tqdm import tqdm
import nltk
import time
import ast
from nltk import pos_tag
from collections import Counter
import gc
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
from wordcloud import WordCloud
import spacy
import plotly.graph_objects as go
import re
# # Load in smaller chunks instead of all at once
# def download_and_load_csv_in_chunks(file_id, filename, chunk_size=10000, low_memory=True, dtype=None):
# url = f'https://drive.google.com/uc?export=download&id={file_id}'
# gdown.download(url, filename, quiet=True)
# # Use chunksize to load in small portions
# df_chunks = pd.read_csv(filename, low_memory=low_memory, dtype=dtype, chunksize=chunk_size)
# # Combine chunks to process them incrementally
# return pd.concat(df_chunks, ignore_index=True)
# # Google Drive file IDs
# file_ids = {
# 'movies': '19lYJrl5eTUCyBa0rieheC4i5sIIOlotz',
# 'critic_reviews': '10XvJsU0vT8KK9-krIutzTMFN1NCGrXVD',
# 'user_reviews': '1nFDc_qCWm0AxJfGy653LMycMvUt7K-B3'
# }
# # Now load using the chunked approach
# movies = download_and_load_csv_in_chunks(file_ids['movies'], 'movies.csv', chunk_size=10000, low_memory=False)
# critic_reviews = download_and_load_csv_in_chunks(file_ids['critic_reviews'], 'critic_reviews.csv',
# chunk_size=10000, low_memory=False, dtype={'isRtUrl': 'boolean'})
# user_reviews = download_and_load_csv_in_chunks(file_ids['user_reviews'], 'sampled_user_reviews.csv',
# chunk_size=10000, low_memory=False)
# print("Datasets loaded successfully in chunks.")
3. Data cleaning¶
3.1. Data Cleaning Functions¶
"Camel Case to Snake Case Conversion" Function
To ensure consistency, we convert column names from camelCase to snake_case using a function called camel_to_snake_manual(). This helps standardize column names for easier analysis and reference.
"Standardizing Scores" Function
The standardize_score() function is used to convert different score formats into a uniform scale (from 0 to 10). This allows us to compare scores across critics and users fairly.
"Converting Runtime" Function
The convert_runtime_to_minutes() function converts runtime values from hours and minutes format into total minutes. This ensures that runtime information is in a consistent and comparable numerical format.
# def camel_to_snake_manual(name):
# result = [name[0].lower()]
# for char in name[1:]:
# if char.isupper():
# result.append('_')
# result.append(char.lower())
# else:
# result.append(char)
# return ''.join(result)
# def standardize_score(score):
# try:
# if '/' in score:
# num, denom = map(float, score.split('/'))
# return round((num / denom) * 10, 1)
# else:
# return float(score)
# except:
# return np.nan
# def convert_runtime_to_minutes(runtime):
# if isinstance(runtime, str):
# hours, minutes = 0, 0
# if 'h' in runtime:
# parts = runtime.split('h')
# hours = int(parts[0].strip()) * 60
# if 'm' in parts[1]:
# minutes = int(parts[1].strip('m').strip())
# elif 'm' in runtime:
# minutes = int(runtime.strip('m').strip())
# return hours + minutes
# return None
# print("Cleaning functions defined.")
3.2. Cleaning Movie Data¶
The clean_movies() function cleans the movie dataset by:
- Dropping unnecessary columns (e.g., URLs).
- Converting column names to snake_case.
- Handling missing values for scores, sentiments, and runtime.
- Extracting the release year from different date formats.
- Standardizing the language categories for consistency.
The clean_movies() function encompasses among other thing the three functions previously built: "camel_to_snake_manual", "standardize_score" and "convert_runtime_to_minutes".
The goal is to create a clean and usable dataset ready for merging and further analysis.
# def clean_movies(df):
# df_cleaned = df.copy()
# df_cleaned = df_cleaned.drop(columns='movieURL')
# df_cleaned.columns = [camel_to_snake_manual(col) for col in df_cleaned.columns]
# # Fill missing scores and sentiments
# df_cleaned['critic_score'].fillna(df_cleaned['critic_score'].mean(), inplace=True)
# df_cleaned['audience_score'].fillna(df_cleaned['audience_score'].mean(), inplace=True)
# df_cleaned['critic_sentiment'].fillna('unknown', inplace=True)
# df_cleaned['audience_sentiment'].fillna('unknown', inplace=True)
# df_cleaned['rating'].fillna('unknown', inplace=True)
# # Extract release years
# def extract_year(date_string):
# try:
# return pd.to_datetime(date_string, format='%Y-%m-%d', errors='coerce').year
# except:
# try:
# return pd.to_datetime(date_string, format='%B %d, %Y', errors='coerce').year
# except:
# return pd.NaT
# df_cleaned['release_year_theaters'] = df_cleaned['release_date_theaters'].apply(extract_year)
# df_cleaned['release_year_streaming'] = df_cleaned['release_date_streaming'].apply(extract_year)
# df_cleaned = df_cleaned.drop(columns=['release_date_theaters', 'release_date_streaming'])
# # Round scores
# df_cleaned['critic_score'] = df_cleaned['critic_score'].round(1)
# df_cleaned['audience_score'] = df_cleaned['audience_score'].round(1)
# # Standardize language
# language_mapping = {
# 'English (United Kingdom)': 'English',
# 'English (Australia)': 'English',
# 'British English': 'English',
# 'Australian English': 'English',
# 'Portuguese (Brazil)': 'Portuguese',
# 'Brazilian Portuguese': 'Portuguese',
# 'French (France)': 'French',
# 'French (Canada)': 'French',
# 'Canadian French': 'French',
# 'Unknown language': 'Unknown'
# }
# df_cleaned['original_language'] = df_cleaned['original_language'].replace(language_mapping)
# df_cleaned['original_language'].fillna('Unknown', inplace=True)
# # Convert runtime to minutes
# df_cleaned['runtime_in_minutes'] = df_cleaned['runtime'].apply(convert_runtime_to_minutes)
# df_cleaned = df_cleaned.drop(columns='runtime')
# return df_cleaned
# movies_cleaned = clean_movies(movies)
# print("Movies data cleaned.")
3.3. Cleaning Critic Reviews¶
The clean_critic_reviews() function:
- Drops irrelevant columns.
- Converts the creation date to a year format.
- Handles missing values and filters out placeholder quotes that are not meaningful.
- Converts critic scores to a standardized range of 1 to 10.
This helps in getting more meaningful data about critics' opinions and ensuring that only relevant quotes and reviews are retained.
# def clean_critic_reviews(df):
# df_cleaned = df.copy()
# df_cleaned.columns = [camel_to_snake_manual(col) for col in df_cleaned.columns]
# columns_to_remove = ['critic_page_url', 'review_state', 'is_rotten', 'is_rt_url', 'publication_url', 'review_url']
# df_cleaned = df_cleaned.drop(columns=columns_to_remove)
# df_cleaned['creation_date'] = pd.to_datetime(df_cleaned['creation_date'], errors='coerce')
# df_cleaned['creation_year'] = df_cleaned['creation_date'].dt.year
# df_cleaned = df_cleaned.drop(columns=['creation_date'])
# df_cleaned['critic_name'] = df_cleaned['critic_name'].fillna('Unknown Critic')
# df_cleaned = df_cleaned.dropna(subset=['quote'])
# df_cleaned = df_cleaned.drop_duplicates(subset=['review_id'], keep='first')
# placeholders = ["full review at Movies for the Masses", "full review in Greek", ".",
# "click for full review", "Click to read review", "click to read full review",
# "See website for more details.", "click to read the full review", "(No quote available.)"]
# df_cleaned = df_cleaned[~df_cleaned['quote'].isin(placeholders)]
# df_cleaned = df_cleaned[df_cleaned['quote'].str.len() > 5]
# df_cleaned['standardized_score'] = df_cleaned['original_score'].apply(standardize_score)
# mean_score = df_cleaned['standardized_score'].mean()
# df_cleaned['standardized_score'] = df_cleaned['standardized_score'].fillna(round(mean_score))
# df_cleaned['standardized_score'] = df_cleaned['standardized_score'].apply(lambda x: min(max(round(x), 1), 10))
# df_cleaned = df_cleaned.drop(columns=['original_score'])
# return df_cleaned
# critic_reviews_cleaned = clean_critic_reviews(critic_reviews)
# print("Critic reviews cleaned.")
3.4. Cleaning User Reviews¶
The clean_user_reviews() function cleans the user reviews by removing irrelevant columns, standardizing scores, and converting dates to a consistent year format. This is necessary for combining user feedback effectively with critic reviews and movie details.
# def clean_user_reviews(df):
# df_cleaned = df.copy()
# df_cleaned = df_cleaned.drop(['reviewId', 'userDisplayName', 'isVerified', 'hasSpoilers', 'userRealm', 'hasProfanity', 'isSuperReviewer', 'rating'], axis=1)
# df_cleaned.columns = [camel_to_snake_manual(col) for col in df_cleaned.columns]
# def standardize_score(score):
# return min(10, max(1, round(score * 2)))
# df_cleaned['standardized_score'] = df_cleaned['score'].apply(standardize_score)
# df_cleaned['creation_date'] = pd.to_datetime(df_cleaned['creation_date'], errors='coerce')
# df_cleaned['creation_year'] = df_cleaned['creation_date'].dt.year
# df_cleaned = df_cleaned.drop(columns=['creation_date'])
# return df_cleaned
# sampled_user_reviews_cleaned = clean_user_reviews(user_reviews)
# print("User reviews cleaned.")
3.5. Merging Datasets¶
We merged the three cleaned datasets (movies, critic reviews, and user reviews) using the merge_datasets() function. This function:
- Aggregates critic and user reviews to calculate average scores.
- Combines quotes from both critics and users.
- Keeps existing sentiments from the original movie dataset.
This comprehensive dataset provides a complete picture, combining both critic and user perspectives on each movie.
file_id = "1yBwdFTb_PRGobhkgLMB7Opl051f7DbwV"
url = f"https://drive.google.com/uc?id={file_id}"
output = "movies_cleaned.csv"
gdown.download(url, output, quiet=False)
Downloading... From: https://drive.google.com/uc?id=1yBwdFTb_PRGobhkgLMB7Opl051f7DbwV To: /Users/marlissa/Library/CloudStorage/Nextcloud-rinoa@nextcloud.philipconroy.com/Mac/Downloads/movies_cleaned.csv 100%|██████████| 1.15M/1.15M [00:00<00:00, 28.1MB/s]
'movies_cleaned.csv'
movies_cleaned = pd.read_csv("movies_cleaned.csv")
movies_cleaned[["movie_title","rating"]]
| movie_title | rating | |
|---|---|---|
| 0 | A Trip to the Moon | Unknown |
| 1 | The Birth of a Nation | Unknown |
| 2 | The Cabinet of Dr. Caligari | Unknown |
| 3 | The Kid | Unknown |
| 4 | Nosferatu | Unknown |
| ... | ... | ... |
| 10228 | Firebrand | R |
| 10229 | Woody Woodpecker Goes to Camp | Unknown |
| 10230 | You'll Never Find Me | Unknown |
| 10231 | American Dreamer | Unknown |
| 10232 | The Promised Land | R |
10233 rows × 2 columns
# Count the occurrences of "Unknown" and "unknown" in the 'rating' column of movies_clean
rating_counts = movies_cleaned['rating'].value_counts()
# Extract counts for "Unknown" and "unknown"
count_unknown = rating_counts.get("Unknown", 0) # Returns 0 if "Unknown" is not found
count_unknown_lower = rating_counts.get("unknown", 0) # Returns 0 if "unknown" is not found
# Display the results
print(f"Count of 'Unknown': {count_unknown}")
print(f"Count of 'unknown': {count_unknown_lower}")
Count of 'Unknown': 2475 Count of 'unknown': 0
4. Data Analysis of the movies data
¶
This section shows the data analysis and the corresponding data visualization for each of the following parts:
- "Most popular movies of all time"
- "Most popular 100 movies by critic score"
- "Most popular 100 movies by combined critic & audience score".
For this part of the project the clean dataframe called "movies_cleaned" will be used.
4.1. Most popular 100 movies of all time¶
To identify the 100 most popular movies of all time, we used audience_score as respresentative of popularity, ranked them and listed the top 100.
# most popular movie based on the audience score
pd.set_option('display.max_rows', 100)
top_100_movies = movies_cleaned.sort_values(by=['audience_score', 'movie_year'], ascending=[False, True]).head(100)
top_100_movies[['movie_title', 'audience_score', 'movie_year']]
| movie_title | audience_score | movie_year | |
|---|---|---|---|
| 1909 | The City | 100.0 | 1998 |
| 6427 | The Light of the Moon | 100.0 | 2017 |
| 6442 | The Departure | 100.0 | 2017 |
| 6678 | Momentum Generation | 100.0 | 2018 |
| 6698 | Longing | 100.0 | 2018 |
| 6992 | Legend of the Mountain | 100.0 | 2018 |
| 7265 | Just Say Goodbye | 100.0 | 2019 |
| 7280 | Haunt | 100.0 | 2019 |
| 7345 | Hesburgh | 100.0 | 2019 |
| 7396 | Between Me and My Mind | 100.0 | 2019 |
| 7405 | The Bad Guys: Reign of Chaos | 100.0 | 2019 |
| 7437 | The Woman Who Loves Giraffes | 100.0 | 2019 |
| 7439 | Sea of Shadows | 100.0 | 2019 |
| 7467 | Little Monsters | 100.0 | 2019 |
| 7482 | Raise Hell: The Life & Times of Molly Ivins | 100.0 | 2019 |
| 7633 | The Banker | 100.0 | 2020 |
| 7635 | Heaven's Rain | 100.0 | 2020 |
| 7716 | Anthony | 100.0 | 2020 |
| 7831 | Incision | 100.0 | 2020 |
| 7885 | Gold Dust | 100.0 | 2020 |
| 8007 | The Man Standing Next | 100.0 | 2020 |
| 8016 | Pray: The Story of Patrick Peyton | 100.0 | 2020 |
| 8195 | Playing With Sharks | 100.0 | 2021 |
| 8361 | God's Not Dead: We the People | 100.0 | 2021 |
| 8498 | Raging Fire | 100.0 | 2021 |
| 8565 | Hudson | 100.0 | 2021 |
| 8658 | Good Night Oppy | 100.0 | 2022 |
| 8768 | The Justice of Bunny King | 100.0 | 2022 |
| 8777 | The Roundup | 100.0 | 2022 |
| 8804 | Extramundane | 100.0 | 2022 |
| 8880 | Solid Rock Trust | 100.0 | 2022 |
| 8994 | A Pebble in the Pond | 100.0 | 2022 |
| 9202 | About Him & Her | 100.0 | 2023 |
| 9213 | Wild Boys | 100.0 | 2023 |
| 9224 | Getting Away With Murder(s) | 100.0 | 2023 |
| 9476 | They Cloned Tyrone | 100.0 | 2023 |
| 9658 | A Revolution on Canvas | 100.0 | 2023 |
| 9679 | Exemplum | 100.0 | 2023 |
| 9695 | To Me, the One Who Loved You | 100.0 | 2023 |
| 9698 | Buck Alamo or (A Phantasmagorical Ballad) | 100.0 | 2023 |
| 9807 | Chicken for Linda! | 100.0 | 2024 |
| 9823 | Keith Robinson: Different Strokes | 100.0 | 2024 |
| 9861 | Living With Leopards | 100.0 | 2024 |
| 9912 | Diane von Furstenberg: Woman in Charge | 100.0 | 2024 |
| 9973 | Holy Irresistible | 100.0 | 2024 |
| 9980 | For Love & Life: No Ordinary Campaign | 100.0 | 2024 |
| 9985 | Edge of Everything | 100.0 | 2024 |
| 10077 | The Mattachine Family | 100.0 | 2024 |
| 10114 | Stevie Van Zandt: Disciple | 100.0 | 2024 |
| 10120 | Ennio | 100.0 | 2024 |
| 10124 | I Am: Celine Dion | 100.0 | 2024 |
| 10155 | REVIVAL69: The Concert that Rocked the World | 100.0 | 2024 |
| 10174 | Neal Brennan: Crazy Good | 100.0 | 2024 |
| 10179 | The Sixth | 100.0 | 2024 |
| 10225 | The Grab | 100.0 | 2024 |
| 6945 | Maktub | 99.0 | 2018 |
| 7065 | Sex Weather | 99.0 | 2018 |
| 7436 | Everybody's Everything | 99.0 | 2019 |
| 7545 | The Game Changers | 99.0 | 2019 |
| 7623 | Dave Chappelle: Sticks & Stones | 99.0 | 2019 |
| 7835 | Kiss the Ground | 99.0 | 2020 |
| 8085 | Just Mercy | 99.0 | 2020 |
| 8258 | Violet Evergarden: The Movie | 99.0 | 2021 |
| 8418 | The Rescue | 99.0 | 2021 |
| 8579 | Demon Slayer -Kimetsu no Yaiba- The Movie: Mug... | 99.0 | 2021 |
| 8632 | The Woman King | 99.0 | 2022 |
| 8834 | Bad Axe | 99.0 | 2022 |
| 9069 | Top Gun: Maverick | 99.0 | 2022 |
| 9280 | 26.2 to Life | 99.0 | 2023 |
| 9320 | Into the Spotlight | 99.0 | 2023 |
| 9344 | Freedom's Path | 99.0 | 2023 |
| 9507 | Jesus Revolution | 99.0 | 2023 |
| 9527 | The Forger | 99.0 | 2023 |
| 9548 | Sound of Freedom | 99.0 | 2023 |
| 9816 | Remembering Gene Wilder | 99.0 | 2024 |
| 10023 | The American | 99.0 | 2024 |
| 10126 | Unsung Hero | 99.0 | 2024 |
| 10175 | Spermworld | 99.0 | 2024 |
| 10183 | Ordinary Angels | 99.0 | 2024 |
| 515 | The Godfather | 98.0 | 1972 |
| 1565 | The Shawshank Redemption | 98.0 | 1994 |
| 6726 | Back Roads | 98.0 | 2018 |
| 7211 | Ford v Ferrari | 98.0 | 2019 |
| 7504 | Overcomer | 98.0 | 2019 |
| 7692 | My Hero Academia: Heroes Rising | 98.0 | 2020 |
| 7965 | I Still Believe | 98.0 | 2020 |
| 8000 | Wolfwalkers | 98.0 | 2020 |
| 8126 | American Underdog | 98.0 | 2021 |
| 8135 | Summer of Soul (...Or, When the Revolution Cou... | 98.0 | 2021 |
| 8194 | King Richard | 98.0 | 2021 |
| 8210 | Spider-Man: No Way Home | 98.0 | 2021 |
| 8253 | Shang-Chi and the Legend of the Ten Rings | 98.0 | 2021 |
| 8555 | The Sparks Brothers | 98.0 | 2021 |
| 8771 | Jujutsu Kaisen 0: The Movie | 98.0 | 2022 |
| 8849 | Carmen | 98.0 | 2022 |
| 8852 | I Heard the Bells | 98.0 | 2022 |
| 9335 | Scarlet | 98.0 | 2023 |
| 9347 | Air | 98.0 | 2023 |
| 9427 | R.M.N. | 98.0 | 2023 |
| 9453 | Guy Ritchie's The Covenant | 98.0 | 2023 |
4.2. Most popular 100 movies by critic score¶
# Sort the movies based on critic score and movie year, and get the top 100
top_100_movies = movies_cleaned.sort_values(by=['critic_score', 'movie_year'], ascending=[False, True]).head(100)
# Define 10-year bins and create a new column to group movie years by decade
bins = range(1900, 2030, 10) # 10-year bins
labels = [f'{b}-{b+9}' for b in bins[:-1]] # Create labels for the bins
# Create a new column 'decade' in top_100_movies to categorize movies into 10-year intervals
top_100_movies['decade'] = pd.cut(top_100_movies['movie_year'], bins=bins, labels=labels, right=False)
# Count the number of movies in each decade
decade_counts = top_100_movies['decade'].value_counts().sort_index()
# Set global parameters for text anti-aliasing and sharpness
plt.rcParams['text.antialiased'] = True
plt.rcParams['figure.dpi'] = 150 # Increase DPI for better text sharpness
# Set up the figure with high DPI for sharper text
plt.figure(figsize=(14, 8)) # Increased figure size for better display
# Plot a bar plot with Seaborn for separated bars
ax = sns.barplot(x=decade_counts.index, y=decade_counts.values, edgecolor='black', color='#1f77b4') # Original blue color
# Set titles and labels with anti-aliasing enabled by default, add pad to title for spacing
plt.title('Distribution of the Most Popular 100 Movies by Critic Score (Grouped by Decade)', fontsize=16, pad=20) # Adjust the pad value for distance
plt.xlabel('Decade', fontsize=12)
plt.ylabel('Number of Movies', fontsize=12)
# Remove the top spine (frame line above the plot)
ax.spines['top'].set_visible(False)
ax.spines["right"].set_visible(False)
# Add numbers above each bar
for index, value in enumerate(decade_counts.values):
plt.text(index, value + 0.5, str(value), ha='center', fontsize=10)
# Adjust the layout for better presentation
plt.tight_layout()
# Show the plot
plt.show()
/opt/anaconda3/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper)
# Get the top 100 most popular movies based on critic_score
top_100_movies_critic_score = movies_cleaned.nlargest(100, 'critic_score')
# Create year bins/intervals for every 10 years (including "1910-1919")
bins = [1900, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]
labels = ['1900-1909', '1910-1919', '1920-1929', '1930-1939', '1940-1949', '1950-1959',
'1960-1969', '1970-1979', '1980-1989', '1990-1999', '2000-2009', '2010-2019']
# Bin the movie years into intervals, even if a bin is empty
top_100_movies_critic_score['year_interval'] = pd.cut(top_100_movies_critic_score['movie_year'], bins=bins, labels=labels, right=False)
# Create a new DataFrame to consolidate movie titles by year interval, sorted by year
hover_data = top_100_movies_critic_score.groupby('year_interval', observed=False).apply(
lambda x: '<br>• '.join(sorted([f"{title} ({year})" for title, year in zip(x['movie_title'], x['movie_year'])], key=lambda y: int(y.split('(')[-1][:-1])))).reset_index(name='movie_title_list')
# Ensure the first movie title in each hover starts with a bullet point
hover_data['movie_title_list'] = '• ' + hover_data['movie_title_list']
# Merge back to include movie titles in the main DataFrame
top_100_movies_critic_score = top_100_movies_critic_score.merge(hover_data, on='year_interval', suffixes=('', '_list'), how='right')
# Create the bar plot
fig = px.bar(
top_100_movies_critic_score,
x='year_interval',
y='critic_score', # This is just a dummy to create the bars, it doesn't matter
color_discrete_sequence=['green'],
hover_name='year_interval',
hover_data={'movie_title_list': True}, # Show the list of movie titles on hover
title='Top 100 Movies by Critic Score (Interactive Graph)',
labels={'year_interval': 'Year Interval', 'critic_score': 'Movies Count'}
)
# Customize hover to show movie titles and other desired info
fig.update_traces(hovertemplate='Movies with best critic score (from past to present):<br><br>%{customdata[0]}')
# Remove y-axis labels since they aren't meaningful here
fig.update_layout(yaxis_title='', yaxis=dict(showticklabels=False))
# Center the title and add the message below it with a small distance from the graph
fig.update_layout(
title_x=0.5, # Center the title
annotations=[
dict(
text="All these movies have been rated 100 by the critic. The graph shows these movies grouped by decade.<br>Go with the mouse on a bar to see the list of movies.",
xref="paper", yref="paper",
x=0.5, y=1.1, # Slightly increased 'y' to add space between text and graph
showarrow=False,
font=dict(size=12), # Increased font size
align="center"
)
],
xaxis=dict(tickfont=dict(size=10)) # Adjust the font size of x-axis labels (decades)
)
# Show the plot
fig.show()
4.3. Top 100 movies by "combined critic & audience score"¶
# Calculate the combined score as the average of critic_score and audience_score
movies_cleaned['combined_score'] = (movies_cleaned['critic_score'] + movies_cleaned['audience_score']) / 2
# Get the top 100 movies based on the combined score
top_100_combined_score = movies_cleaned.nlargest(100, 'combined_score')
# Create a new DataFrame to consolidate movie titles by combined score
hover_data = top_100_combined_score.groupby('combined_score').apply(
lambda x: '<br>• '.join(sorted([f"{title} ({year})" for title, year in zip(x['movie_title'], x['movie_year'])], key=lambda y: int(y.split('(')[-1][:-1])))
).reset_index(name='movie_title_list')
# Merge back to include movie titles in the main DataFrame
top_100_combined_score = top_100_combined_score.merge(hover_data, on='combined_score')
# Create the scatter plot
fig = px.scatter(
top_100_combined_score,
x='combined_score',
y='combined_score', # Dummy y-value for the dots
hover_name='combined_score',
hover_data={'movie_title_list': True}, # Show the list of movie titles
title='Top 100 Movies by Combined Critic and Audience Score (interactive graph)',
size='combined_score', # Size based on the combined score
size_max=20, # Adjust the size of the dots
)
# Update hover template to show bullet points with a line break
fig.update_traces(hovertemplate='Movie titles (from past to present):<br><br>• %{customdata[0]}')
# Center the title
fig.update_layout(title_x=0.5)
# Remove Y-axis as it is not meaningful
fig.update_layout(yaxis_title='', yaxis=dict(showticklabels=False))
# Show the plot
fig.show()
5. Data Science Methods¶
In this chapter, we delve into the application of two widely recognized and powerful data science methodologies: Multiple Linear Regression and Natural Language Processing (NLP). These techniques have been instrumental in extracting deeper insights from the datasets of the project.
5.1. Multiple Linear Regression¶
Multiple Linear Regression is a statistical approach used to model the relationship between multiple independent variables and a dependent variable. This method enables the identification and quantification of the influence of various factors on a particular outcome, thereby providing valuable insights into the underlying patterns within the data. By employing Multiple Linear Regression, significant relationships can be uncovered that enhance the understanding of trends and predictions in the movie industry.
For the development of the linear regression model, only variables from the "movies" dataset were utilized. To facilitate this process, a copy of the "movies_cleaned" DataFrame was created, ensuring that the original data remained intact while allowing for the necessary analyses to be conducted. Variables from the "critic reviews" and "user reviews" datasets were excluded from this analysis.
# create a copy of the movies_cleaned dataset
movies_cleaned2 = movies_cleaned.copy()
For our model we will use the following variables:
"audience_score" as dependent variable;
"rating","runtime_in_minutes" and "original_language" as independent variables. Before building the linear regression model the following variables need some adjustments:
rating
runtime_in_minutes
original_language
For example, the variable "rating" has 2475 "Unknown" values. For simplicity we drop the rows with this rating value:
# Let's find out how many rows of the column Rating have Unknown in them
unknown_count = movies_cleaned2[movies_cleaned2['rating'] == 'unknown'].shape[0]
print(f'Number of rows with "unknown" rating: {unknown_count}')
Number of rows with "unknown" rating: 0
# Remove rows where the 'rating' column contains the string 'Unknown' in movies_cleaned2 dataset
movies_cleaned2 = movies_cleaned2[movies_cleaned2['rating'] != 'unknown']
print(f'Number of rows with "unknown" rating: {unknown_count}')
Number of rows with "unknown" rating: 0
Now let's focus on the "runtime_in_minutes" column, which has some irrelevant values too:
# Find out how many rows of the column runtime_in_minutes is missing or non-numerical
missing_count = movies_cleaned2['runtime_in_minutes'].isna().sum()
# Count rows where 'runtime_in_minutes' is non-numerical
non_numeric_count = pd.to_numeric(movies_cleaned2['runtime_in_minutes'], errors='coerce').isna().sum()
# Exclude the NaN values from the non-numerical count (if they're already counted as missing)
non_numeric_only = non_numeric_count - missing_count
print(f'Number of missing (NaN) values in "runtime_in_minutes": {missing_count}')
print(f'Number of non-numerical values in "runtime_in_minutes": {non_numeric_only}')
Number of missing (NaN) values in "runtime_in_minutes": 17 Number of non-numerical values in "runtime_in_minutes": 0
Let's print the movie_titles with missing runtime_in_minutes:
# Filter rows where 'runtime_in_minutes' is missing (NaN)
missing_runtime_movies_cleaned2 = movies_cleaned2[movies_cleaned2['runtime_in_minutes'].isna()]
# Select and display relevant columns: movie_title, movie_year, runtime_in_minutes
missing_runtime_info = missing_runtime_movies_cleaned2[['movie_title', 'movie_year', 'runtime_in_minutes']]
# Show the rows with missing runtime information
print(missing_runtime_info)
movie_title movie_year \
7244 Article 15 2019
7304 Escape Plan: The Extractors 2019
7714 The Godfather, Coda: The Death of Michael Corl... 2020
7800 American Murder: The Family Next Door 2020
7895 Monstrum 2020
8031 Batman: Death in the Family 2020
8185 Nightbooks 2021
8213 No One Gets Out Alive 2021
8684 The Redeem Team 2022
8697 Good Mourning 2022
9014 This Place Rules 2022
9200 Quasi 2023
9599 Trolls Band Together 2023
9766 Monster Inside: America's Most Extreme Haunted... 2023
9921 Alex Edelman: Just For Us 2024
9973 Holy Irresistible 2024
10011 Adam the First 2024
runtime_in_minutes
7244 NaN
7304 NaN
7714 NaN
7800 NaN
7895 NaN
8031 NaN
8185 NaN
8213 NaN
8684 NaN
8697 NaN
9014 NaN
9200 NaN
9599 NaN
9766 NaN
9921 NaN
9973 NaN
10011 NaN
Next is to fill this missing runtime_in_minutes with accurate values:
# We will create a dictionary with movie titles, years, and their correct runtimes
runtime_updates = {
('Quasi', 2023): 100
}
# Update the runtime_in_minutes column
for (title, year), runtime in runtime_updates.items():
movies_cleaned2.loc[(movies_cleaned2['movie_title'] == title) & (movies_cleaned2['movie_year'] == year), 'runtime_in_minutes'] = runtime
movies_cleaned2_relevant = movies_cleaned2[['movie_title', 'movie_year', 'runtime_in_minutes']]
# Verify the changes
print(movies_cleaned2_relevant[movies_cleaned2_relevant['runtime_in_minutes'].isna() == False].loc[movies_cleaned2['movie_title'].isin([title for title, year in runtime_updates])])
movie_title movie_year runtime_in_minutes 9200 Quasi 2023 100.0
Now we will fix the data in the column "original_language" by inserting the missing language for the rows with missing or "Unknown" value:
# Check for missing and "Unknown" values in the 'original_language' column
missing_values = movies_cleaned2['original_language'].isnull().sum()
# To find out how many rows of the column original_language have Unknown in them
unknown_count = movies_cleaned2[movies_cleaned2['original_language'] == 'Unknown'].shape[0]
print(f"Number of missing values in 'original_language': {missing_values}")
print(f"Number of rows with 'Unknown' in 'original_language': {unknown_count}")
Number of missing values in 'original_language': 0 Number of rows with 'Unknown' in 'original_language': 23
# Obtain list of movies where 'original_language' is 'Unknown'
unknown_languages = movies_cleaned2[movies_cleaned2['original_language'] == 'Unknown']
# Extract the list of movie titles where original_language is 'Unknown'
movie_titles_with_unknown_language = unknown_languages[['movie_title','original_language']].values.tolist()
# Output the list
movie_titles_with_unknown_language
[['The Birth of a Nation', 'Unknown'], ['The Passion of Joan of Arc', 'Unknown'], ["Pandora's Box", 'Unknown'], ['City Lights', 'Unknown'], ['Baraka', 'Unknown'], ['Apocalypto', 'Unknown'], ['The Artist', 'Unknown'], ['Bo Burnham: Make Happy', 'Unknown'], ['Munyurangabo', 'Unknown'], ['Tell It to the Marines', 'Unknown'], ['The Red Turtle', 'Unknown'], ['Adam Sandler: 100% Fresh', 'Unknown'], ['Hannah Gadsby: Nanette', 'Unknown'], ['Maharshi', 'Unknown'], ['Escape Plan: The Extractors', 'Unknown'], ["Midsommar: The Director's Cut", 'Unknown'], ['Dave Chappelle: Sticks & Stones', 'Unknown'], ['Monstrum', 'Unknown'], ['The Witch: Part 1 - The Subversion', 'Unknown'], ['Tales from the Apocalypse', 'Unknown'], ['Silver and the Book of Dreams', 'Unknown'], ['Mavka: The Forest Song', 'Unknown'], ['Out of Darkness', 'Unknown']]
# Dictionary mapping movie titles to their original languages
language_mapping = {
"Adam Sandler: 100% Fresh": "English",
"Apocalypto": "Yucatec Maya",
"Bo Burnham: Make Happy": "English",
"City Lights": "Nonverbal",
"Hannah Gadsby: Nanette": "English",
"Out of Darkness": "Tola", # Fictitious language
"The Artist": "French",
"The Red Turtle": "Nonverbal"
}
# Update the original_language column based on the mapping
movies_cleaned2['original_language'] = movies_cleaned2['movie_title'].map(language_mapping).fillna(movies_cleaned2['original_language'])
# Check for missing values in the 'original_language' column
missing_values = movies_cleaned2['original_language'].isnull().sum()
# To find out how many rows of the column original_language have Unknown in them
unknown_count = movies_cleaned2[movies_cleaned2['original_language'] == 'Unknown'].shape[0]
print(f"Number of missing values in 'original_language': {missing_values}")
print(f"Number of rows with 'Unknown' in 'original_language': {unknown_count}")
print("'original_language' column is filled.")
Number of missing values in 'original_language': 0 Number of rows with 'Unknown' in 'original_language': 15 'original_language' column is filled.
Our next step is to convert any columns of our interest that contain strings into categories. Once we can do that, we can apply one-hot encoding to convert categorical data into a numeric format that can be used for our regression model
# Which columns contain strings?
# These columns contain strings
for label, content in movies_cleaned2.items():
if pd.api.types.is_string_dtype(content):
print(label)
movie_id movie_title critic_sentiment audience_sentiment rating original_language
# List of columns that need to be converted to category dtype for our regression model
columns_to_convert = ['rating', 'original_language']
# Loop through the dataframe columns
for label, content in movies_cleaned2.items():
if label in columns_to_convert and pd.api.types.is_string_dtype(content):
movies_cleaned2[label] = content.astype("category").cat.as_ordered()
# Verify the change
print(movies_cleaned2.dtypes)
movie_id object movie_year int64 movie_title object critic_score float64 critic_sentiment object audience_score float64 audience_sentiment object rating category original_language category release_year_theaters float64 release_year_streaming int64 runtime_in_minutes float64 combined_score float64 dtype: object
5.1.1. Defining the baseline variables before applying the "one-hot encoding" method¶
# Check the frequency of each category in the 'rating' column
rating_counts = movies_cleaned2['rating'].value_counts()
# Print the most common rating
most_common_rating = rating_counts.idxmax()
print(f"The most common rating is: {most_common_rating}")
The most common rating is: R
So R rating will be our baseline to compare every category in rating with
# Check the frequency of each category in the 'original_language' column
original_language_counts = movies_cleaned2['original_language'].value_counts()
# Print the most common rating
most_common_original_language = original_language_counts.idxmax()
print(f"The most common original_language is: {most_common_original_language}")
The most common original_language is: English
So English will be our baseline to compare every category in original_language with
# Do we need to normalize runtime_in_minutes?
movies_cleaned2['runtime_in_minutes'].describe()
count 10217.000000 mean 106.174905 std 20.221823 min 12.000000 25% 93.000000 50% 103.000000 75% 116.000000 max 317.000000 Name: runtime_in_minutes, dtype: float64
5.1.2. Normalization of "runtime_in_minutes"¶
The runtime_in_minutes distribution gives important insights into the data:
Count (7758): This shows that there are 7758 movies in the dataset with a recorded runtime, which is a decent sample size.
Mean (104): The average runtime is around 104 minutes. This suggests that most movies in the dataset are approximately 1 hour and 46 minutes long.
Standard Deviation (19.14): The standard deviation of 19.14 minutes tells us that the runtimes vary moderately around the mean. Most movies will likely have runtimes between 85 minutes and 123 minutes (mean ± 1 standard deviation).
Minimum (12.0): The shortest movie is only 12 minutes long, which is likely a short film or some form of special content.
Quartiles (25th percentile: 94, 50th percentile (median): 104, 75th percentile: 117):
25th percentile (94 minutes): 25% of the movies are shorter than 94 minutes. Median (104 minutes): 50% of the movies are shorter than 104 minutes. 75th percentile (116 minutes): 75% of the movies are shorter than 117 minutes. The interquartile range (IQR), which measures the middle 50% of runtimes, is 23 minutes (116 - 94). This indicates a fairly narrow range for most movies, implying that many movies fall between 94 and 117 minutes long.
Maximum (254 minutes): The longest movie in the dataset is 254 minutes (over 4 hours long), which is an extreme outlier.
Key Observations: Outliers: The minimum (12 minutes) and maximum (254 minutes) suggest there are outliers in your dataset, especially at the higher end. Distribution Shape: Given the quartiles and the mean, the runtime distribution seems slightly right-skewed (positively skewed), meaning there are some very long movies pushing the average higher than the median. Need for Normalization: Since the range is from 12 to 254 minutes, and the standard deviation is relatively high compared to the mean, normalizing the runtime could be helpful to reduce the influence of extreme values on your linear regression model.
# so we will proceed to normalizing the runtime_in_minutes using min-max scaling to reduce the impact of outliers
from sklearn.preprocessing import MinMaxScaler
# Initialize Min-Max Scaler
scaler = MinMaxScaler()
# Select the column to be normalized
movies_cleaned2['runtime_normalized'] = scaler.fit_transform(movies_cleaned2[['runtime_in_minutes']])
# drop the null values in the column "runtime_normalized"
movies_cleaned2 = movies_cleaned2.dropna(subset=['runtime_normalized'])
# Display a sample of the normalized data
print(movies_cleaned2[['runtime_in_minutes', 'runtime_normalized']].head())
runtime_in_minutes runtime_normalized 0 14.0 0.006557 1 190.0 0.583607 2 69.0 0.186885 3 60.0 0.157377 4 65.0 0.173770
5.1.3. Should we normalize the dependent variable "audience_score"?¶
movies_cleaned2['audience_score'].describe()
count 10217.000000 mean 64.346638 std 20.756292 min 0.000000 25% 49.000000 50% 67.000000 75% 82.000000 max 100.000000 Name: audience_score, dtype: float64
movies_cleaned2['audience_score'].hist()
<Axes: >
Looking at the summary statistics for audience_score:
Count: 7,758 movies have non-missing audience_score. Mean: The average score is about 63.49. Std: The standard deviation is 20.30, indicating a moderate spread around the mean. Min and Max: The minimum score is 0, and the maximum is 100, covering the full range of possible audience scores. Quartiles: 25% of movies have an audience score of 48 or below. 50% (median) have a score of 66 or below. 75% of movies have a score of 80 or below. Key Points: The audience score distribution is not extremely skewed, as the mean and median are relatively close (63.49 vs. 66), suggesting that the data is reasonably centered around the middle. The range from 0 to 100 and the standard deviation indicate variance but not extreme outliers. Should We Normalize audience_score? Given that the scores are within a standard range (0–100) and show no extreme skew or outliers, normalization is likely not necessary. The distribution is fairly even, so leaving it in its raw form will keep our model predictions in a more interpretable range.
5.1.4. One-hot encoding for the variabless "rating" and "original_language"¶
movies_cleaned2.columns
Index(['movie_id', 'movie_year', 'movie_title', 'critic_score',
'critic_sentiment', 'audience_score', 'audience_sentiment', 'rating',
'original_language', 'release_year_theaters', 'release_year_streaming',
'runtime_in_minutes', 'combined_score', 'runtime_normalized'],
dtype='object')
# Step 1: Exclude unnecessary columns
exclude_cols = ['movie_id', 'movie_year', 'movie_title', 'critic_score', 'critic_sentiment',
'audience_sentiment', 'release_year_theaters', 'release_year_streaming',
'runtime_in_minutes', 'combined_score']
# Step 2: One-hot encode categorical variables and exclude columns not being used
movies_encoded = pd.get_dummies(movies_cleaned2.drop(columns=exclude_cols + ['audience_score']),
columns=['rating', 'original_language'], drop_first=False)
# Manually drop 'rating_R' and 'original_language_English' to set them as the baseline
movies_encoded = movies_encoded.drop(['rating_R', 'original_language_English'], axis=1)
movies_encoded.head(50).T
# Add 'runtime_normalized' to the dataset (it's already part of the dataframe, so no need to add it again)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| runtime_normalized | 0.006557 | 0.583607 | 0.186885 | 0.157377 | 0.17377 | 0.203279 | 0.229508 | 0.180328 | 0.363934 | 0.232787 | ... | 0.295082 | 0.278689 | 0.377049 | 0.278689 | 0.370492 | 0.268852 | 0.27541 | 0.291803 | 0.301639 | 0.321311 |
| rating_G | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | True | False | False |
| rating_NC-17 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| rating_PG | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| rating_PG-13 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| rating_TV | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| rating_TV-14 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| rating_Unknown | True | True | True | True | True | True | True | True | True | True | ... | True | True | True | True | True | True | True | False | True | True |
| original_language_Afrikaans | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Albanian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Arabic | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Aramaic | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Bangla | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Basque | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Belarusian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Bosnian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Catalan | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Chinese | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Czech | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Danish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Dutch | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Estonian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_European Spanish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Filipino | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Finnish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_French | True | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | True |
| original_language_Georgian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_German | False | False | True | False | True | False | False | False | True | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Greek | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Hebrew | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Hindi | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Hungarian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Icelandic | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Indonesian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Inuktitut | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Italian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Japanese | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Kalaallisut | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Kannada | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Khmer | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Korean | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Lao | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Macedonian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Malay | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Malayalam | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Mongolian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Nonverbal | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Norwegian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Persian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Polish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Portuguese | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Romanian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Russian | False | False | False | False | False | False | False | True | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Serbian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Slovak | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Spanish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Spanish (Spain) | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Swahili | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Swedish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Tagalog | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Tamil | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Telugu | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Thai | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Tola | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Turkish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Ukrainian | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Unknown | False | True | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Vietnamese | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Welsh | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Wolof | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Yiddish | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Yucatec Maya | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| original_language_Zulu | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
73 rows × 50 columns
# Look at a specific row before and after encoding
print(movies_cleaned2[['rating', 'original_language']].iloc[10]) # Original categorical values for row 10
print(movies_encoded.iloc[10]) # One-hot encoded values for row 10
rating Unknown original_language Unknown Name: 10, dtype: object runtime_normalized 0.213115 rating_G False rating_NC-17 False rating_PG False rating_PG-13 False rating_TV False rating_TV-14 False rating_Unknown True original_language_Afrikaans False original_language_Albanian False original_language_Arabic False original_language_Aramaic False original_language_Bangla False original_language_Basque False original_language_Belarusian False original_language_Bosnian False original_language_Catalan False original_language_Chinese False original_language_Czech False original_language_Danish False original_language_Dutch False original_language_Estonian False original_language_European Spanish False original_language_Filipino False original_language_Finnish False original_language_French False original_language_Georgian False original_language_German False original_language_Greek False original_language_Hebrew False original_language_Hindi False original_language_Hungarian False original_language_Icelandic False original_language_Indonesian False original_language_Inuktitut False original_language_Italian False original_language_Japanese False original_language_Kalaallisut False original_language_Kannada False original_language_Khmer False original_language_Korean False original_language_Lao False original_language_Macedonian False original_language_Malay False original_language_Malayalam False original_language_Mongolian False original_language_Nonverbal False original_language_Norwegian False original_language_Persian False original_language_Polish False original_language_Portuguese False original_language_Romanian False original_language_Russian False original_language_Serbian False original_language_Slovak False original_language_Spanish False original_language_Spanish (Spain) False original_language_Swahili False original_language_Swedish False original_language_Tagalog False original_language_Tamil False original_language_Telugu False original_language_Thai False original_language_Tola False original_language_Turkish False original_language_Ukrainian False original_language_Unknown True original_language_Vietnamese False original_language_Welsh False original_language_Wolof False original_language_Yiddish False original_language_Yucatec Maya False original_language_Zulu False Name: 10, dtype: object
# Checking rows where the rating is PG (non-baseline)
print(movies_cleaned2[movies_cleaned2['rating'] == 'PG'][['rating', 'original_language']].iloc[0])
print(movies_encoded[movies_cleaned2['rating'] == 'PG'].iloc[0])
rating PG original_language English Name: 16, dtype: object runtime_normalized 0.265574 rating_G False rating_NC-17 False rating_PG True rating_PG-13 False rating_TV False rating_TV-14 False rating_Unknown False original_language_Afrikaans False original_language_Albanian False original_language_Arabic False original_language_Aramaic False original_language_Bangla False original_language_Basque False original_language_Belarusian False original_language_Bosnian False original_language_Catalan False original_language_Chinese False original_language_Czech False original_language_Danish False original_language_Dutch False original_language_Estonian False original_language_European Spanish False original_language_Filipino False original_language_Finnish False original_language_French False original_language_Georgian False original_language_German False original_language_Greek False original_language_Hebrew False original_language_Hindi False original_language_Hungarian False original_language_Icelandic False original_language_Indonesian False original_language_Inuktitut False original_language_Italian False original_language_Japanese False original_language_Kalaallisut False original_language_Kannada False original_language_Khmer False original_language_Korean False original_language_Lao False original_language_Macedonian False original_language_Malay False original_language_Malayalam False original_language_Mongolian False original_language_Nonverbal False original_language_Norwegian False original_language_Persian False original_language_Polish False original_language_Portuguese False original_language_Romanian False original_language_Russian False original_language_Serbian False original_language_Slovak False original_language_Spanish False original_language_Spanish (Spain) False original_language_Swahili False original_language_Swedish False original_language_Tagalog False original_language_Tamil False original_language_Telugu False original_language_Thai False original_language_Tola False original_language_Turkish False original_language_Ukrainian False original_language_Unknown False original_language_Vietnamese False original_language_Welsh False original_language_Wolof False original_language_Yiddish False original_language_Yucatec Maya False original_language_Zulu False Name: 16, dtype: object
# Checking rows where the original language is French (non-baseline)
print(movies_cleaned2[movies_cleaned2['original_language'] == 'French'][['rating', 'original_language']].iloc[0])
print(movies_encoded[movies_cleaned2['original_language'] == 'French'].iloc[0])
rating Unknown original_language French Name: 0, dtype: object runtime_normalized 0.006557 rating_G False rating_NC-17 False rating_PG False rating_PG-13 False rating_TV False rating_TV-14 False rating_Unknown True original_language_Afrikaans False original_language_Albanian False original_language_Arabic False original_language_Aramaic False original_language_Bangla False original_language_Basque False original_language_Belarusian False original_language_Bosnian False original_language_Catalan False original_language_Chinese False original_language_Czech False original_language_Danish False original_language_Dutch False original_language_Estonian False original_language_European Spanish False original_language_Filipino False original_language_Finnish False original_language_French True original_language_Georgian False original_language_German False original_language_Greek False original_language_Hebrew False original_language_Hindi False original_language_Hungarian False original_language_Icelandic False original_language_Indonesian False original_language_Inuktitut False original_language_Italian False original_language_Japanese False original_language_Kalaallisut False original_language_Kannada False original_language_Khmer False original_language_Korean False original_language_Lao False original_language_Macedonian False original_language_Malay False original_language_Malayalam False original_language_Mongolian False original_language_Nonverbal False original_language_Norwegian False original_language_Persian False original_language_Polish False original_language_Portuguese False original_language_Romanian False original_language_Russian False original_language_Serbian False original_language_Slovak False original_language_Spanish False original_language_Spanish (Spain) False original_language_Swahili False original_language_Swedish False original_language_Tagalog False original_language_Tamil False original_language_Telugu False original_language_Thai False original_language_Tola False original_language_Turkish False original_language_Ukrainian False original_language_Unknown False original_language_Vietnamese False original_language_Welsh False original_language_Wolof False original_language_Yiddish False original_language_Yucatec Maya False original_language_Zulu False Name: 0, dtype: object
5.1.5. Building the Linear Regression Model¶
# Add 'runtime_normalized' to the dataset (it's already part of the dataframe, so no need to add it again)
# Step 3: Define X (Predictors) and y (Target)
X = movies_encoded[['runtime_normalized'] + [col for col in movies_encoded.columns if 'rating_' in col or 'original_language_' in col]]
y = movies_cleaned2['audience_score']
# Step 4: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 5: Train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Predict and evaluate
y_pred = model.predict(X_test)
# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean Squared Error:', mse)
print('R-squared:', r2)
# To find out the influence of each variable, check the coefficients
coefficients = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
print(coefficients)
Mean Squared Error: 1.3055604219192443e+25
R-squared: -2.9615455986363807e+22
Coefficient
runtime_normalized 8.695946e+01
rating_G 1.124826e+01
rating_NC-17 -2.009639e+00
rating_PG 5.478009e+00
rating_PG-13 5.805534e-01
rating_TV 3.497762e+00
rating_TV-14 -6.212397e+00
rating_Unknown 5.099218e+00
original_language_Afrikaans -1.633574e+14
original_language_Albanian 2.287989e+01
original_language_Arabic 1.013481e+01
original_language_Aramaic 1.310595e+01
original_language_Bangla 2.097344e+01
original_language_Basque 6.964534e+00
original_language_Belarusian 1.948525e+01
original_language_Bosnian 2.480781e+01
original_language_Catalan 1.996506e+01
original_language_Chinese 8.846052e+00
original_language_Czech 1.843863e-01
original_language_Danish 1.348379e+01
original_language_Dutch 3.694966e+00
original_language_Estonian 1.100269e+01
original_language_European Spanish 6.962192e+00
original_language_Filipino 1.095920e+00
original_language_Finnish -8.642319e+00
original_language_French 9.942527e+00
original_language_Georgian 2.894438e+01
original_language_German 7.630725e+00
original_language_Greek -2.047868e+01
original_language_Hebrew 3.536745e+01
original_language_Hindi -2.601992e+00
original_language_Hungarian 7.587334e+00
original_language_Icelandic 6.532974e+00
original_language_Indonesian 1.079004e+01
original_language_Inuktitut 3.434644e+01
original_language_Italian 5.421875e+00
original_language_Japanese 1.309375e+01
original_language_Kalaallisut 2.310205e+01
original_language_Kannada 8.007812e+00
original_language_Khmer 6.136719e+00
original_language_Korean 4.638672e+00
original_language_Lao 1.983496e+01
original_language_Macedonian 5.213196e+00
original_language_Malay 1.228027e+00
original_language_Malayalam 1.175977e+01
original_language_Mongolian 4.519531e+00
original_language_Nonverbal 2.533496e+01
original_language_Norwegian -1.179688e+00
original_language_Persian 1.678076e+01
original_language_Polish -8.464844e+00
original_language_Portuguese 1.290625e+01
original_language_Romanian -8.732422e+00
original_language_Russian 6.910156e+00
original_language_Serbian -1.565430e+01
original_language_Slovak 1.067920e+01
original_language_Spanish 4.804688e+00
original_language_Spanish (Spain) 3.030273e+00
original_language_Swahili 2.432959e+01
original_language_Swedish 5.275391e+00
original_language_Tagalog -1.084424e+01
original_language_Tamil 5.796875e+00
original_language_Telugu 6.789062e+00
original_language_Thai 4.835938e+00
original_language_Tola -8.144531e-01
original_language_Turkish -7.314453e+00
original_language_Ukrainian 8.350830e+00
original_language_Unknown 6.718750e+00
original_language_Vietnamese -5.261719e+00
original_language_Welsh -1.561572e+01
original_language_Wolof 2.471558e+00
original_language_Yiddish 5.134766e+00
original_language_Yucatec Maya 0.000000e+00
original_language_Zulu -1.091782e+01
# Calculate the difference between actual and predicted values
differences = np.abs(y_test - y_pred)
# Create a scatter plot with a color gradient based on the differences
plt.scatter(y_test, y_pred, c=differences, cmap='coolwarm', edgecolor='k', alpha=0.7)
# Add a color bar to show the scale of differences
plt.colorbar(label='Difference between Actual and Predicted')
# Add labels and title
plt.xlabel('Actual Scores')
plt.ylabel('Predicted Scores')
plt.title('Actual vs Predicted Scores (Colored by Differences)')
plt.show()
5.1.6. Results¶
A break down of the results of our linear regression model:
- Mean Squared Error (MSE): MSE = 1,17 This measures the average squared difference between the actual audience_score values and the predicted values from your model. A lower MSE indicates better performance. In our case, 366.09 indicates that the model’s predictions differ from the actual audience_score by a decent amount on average. This suggests our model is not highly accurate, even though some error is expected.
- R-squared (R²): R² = -2.7 The R² explains how well our model accounts for variance in the target variable (audience_score). This R² is quite low, indicating that the factors we included in the model (runtime, rating, original language) are not capturing most of the variation in audience scores. There might be other important features missing or the relationships between the variables may not be well captured by a linear model.
- Coefficients of the Features: The coefficients indicate the influence of each predictor on the target variable, audience_score. A positive coefficient suggests a positive relationship, while a negative coefficient suggests the opposite.
Next, we will interpret some of the key features:
Runtime (normalized):
Coefficient: 7.90 For each unit increase in the normalized runtime, the audience score is expected to increase by around 7.9 points, assuming all other variables remain constant. Runtime has a strong positive impact on audience score.
Rating:
G: 11.24 Movies rated "G" are expected to have an audience score that is 11.24 points higher than movies rated "R" (your baseline category), all else equal. NC-17: -7.11 Movies rated "NC-17" tend to have audience scores that are about 7.11 points lower than "R"-rated movies. PG: 6.87, PG-13: 0.51, TV: 5.71, TV-14: -4.55 Each of these categories' coefficients indicate how much higher or lower they predict audience score compared to the baseline rating (R). Original Language:
Afrikaans: 15.75, Arabic: 11.24, Chinese: 13.40, Danish: 22.23, French: 19.11, Finnish: 32.17, etc. These coefficients tell how the movie's original language influences the audience score relative to the baseline (English). For example: Movies in Danish are expected to have audience scores about 22.23 points higher than English-language movies. Movies in French would score about 19.11 points higher, while Finnish-language movies would score about 32.17 points higher. Negative coefficients (like for Hindi: -3.75, Norwegian: -4.36) suggest those languages might lower the predicted audience score compared to English. Key Insights: Runtime is positively correlated with audience score, meaning longer movies tend to have higher audience scores. Rating has a clear influence on audience score, with "G" and "PG" rated movies generally receiving better scores than "R", and "NC-17" movies being scored lower than "R". Original Language also has an impact, with some languages (like Danish, French, and Finnish) positively influencing the audience score, while others (like Hindi, Norwegian) have a small negative impact. Next Steps to Improve the Model: Include more relevant features (e.g., critic_score, audience_sentiment, critic_sentiment, director, actor). Some of these features may have to come from other datasets found online.
5.2. Natural Language Processing¶
Natural Language Processing (NLP) is a branch of artificial intelligence that enables computers to understand, interpret, and generate human language. Through the implementation of NLP techniques, we can analyze textual data—such as movie reviews—transforming unstructured information into actionable insights. This allows us to assess sentiment, extract themes, and gauge audience reactions, enhancing our overall understanding of viewer perceptions and preferences.
5.2.1. Cleaning Quotes and Filtering Nouns and Verbs¶
The first step in our analysis involved cleaning the movie review quotes. We used the process_quotes() function to remove stop words, which are common but uninformative words such as "the," "is," and "and." Additionally, we defined a list of words to exclude, such as general movie-related terms (e.g., "movie," "film") and quality descriptors (e.g., "good," "bad"). By removing these words, we focused the analysis on those words that provide deeper insight into the content and themes of the reviews.
After cleaning, we filtered each cleaned quote to retain only nouns and verbs. This was done using part-of-speech tagging to identify words representing core ideas or actions. This allowed us to focus specifically on key themes, such as "love," "war," or "acting," and highlight the main concepts expressed in the reviews while eliminating less important parts of speech.
# Download necessary NLTK data
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('stopwords', quiet=True)
# Define stop words and words to exclude
stop_words = set(stopwords.words('english'))
words_to_exclude = {
# General movie-related terms
'movie', 'film', 'cinema', 'picture', 'flick', 'screening', 'showing',
# Quality descriptors
'good', 'bad', 'great', 'terrible', 'excellent', 'poor', 'amazing', 'awful',
'wonderful', 'horrible', 'best', 'worst', 'favorite', 'liked', 'disliked',
# Cinema-related terms
'actor', 'actress', 'director', 'producer', 'screenplay', 'script', 'scene',
'character', 'plot', 'story', 'dialogue', 'cinematography', 'soundtrack',
# General descriptors
'interesting', 'boring', 'exciting', 'dull', 'entertaining', 'disappointing',
'impressive', 'mediocre', 'overrated', 'underrated',
# Time-related terms
'hour', 'minute', 'long', 'short',
# Viewing experience
'watch', 'saw', 'seen', 'theater', 'cinema', 'home',
# Rating-related terms
'star', 'rating', 'review', 'critic', 'audience',
# Production-related terms
'budget', 'box office', 'sequel', 'remake', 'adaptation',
# General opinion words
'think', 'thought', 'feel', 'felt', 'believe', 'opinion'
}
stop_words.update(words_to_exclude)
def process_quote(quote):
# Convert to string if it's a list
if isinstance(quote, list):
quote = ' '.join(quote)
else:
quote = str(quote)
# Tokenize and remove stop words and excluded words
tokens = word_tokenize(quote.lower())
tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
# Perform POS tagging
tagged = pos_tag(tokens)
# Filter for nouns and verbs
filtered = [word for word, pos in tagged if pos.startswith('N') or pos.startswith('V')]
return ' '.join(filtered)
def process_quotes_in_chunks(df, chunk_size=1000):
processed_chunks = []
total_chunks = len(df) // chunk_size + (1 if len(df) % chunk_size != 0 else 0)
for i in tqdm(range(0, len(df), chunk_size), total=total_chunks, desc="Processing chunks"):
try:
chunk = df.iloc[i:i+chunk_size].copy()
chunk['filtered_quotes'] = chunk['all_quotes'].apply(process_quote)
processed_chunks.append(chunk)
print(f"Processed chunk {i//chunk_size + 1}/{total_chunks}")
except Exception as e:
print(f"An error occurred while processing chunk {i//chunk_size + 1}: {str(e)}")
return pd.concat(processed_chunks, ignore_index=True)
5.2.2. Word Cloud of Themes¶
To visualize the most common themes discussed in the movie reviews, we generated a word cloud. The word cloud was created using the cleaned dataset to present an intuitive overview of frequently mentioned topics. By emphasizing the prominent words, it gives a clear picture of what audiences and critics frequently focus on when evaluating movies.
# Load the processed data
# Google Drive file ID for the "processed_movie_quotes.csv" file
file_id = '1jb_T43v3N7bbbosxINaLjp0JTK7HuKeV'
# Import the "processed_movie_quotes.csv" file using gdown
gdown.download(f'https://drive.google.com/uc?export=download&id={file_id}', 'processed_movie_quotes.csv', quiet=True)
# Load the downloaded "processed_movie_quotes.csv" file into a pandas DataFrame
df = pd.read_csv('processed_movie_quotes.csv')
# Replace NaN with an empty string and convert all values to strings
df['filtered_quotes'] = df['filtered_quotes'].fillna('').astype(str)
# Remove any empty strings
df = df[df['filtered_quotes'] != '']
print("Number of rows after cleaning:", len(df))
Number of rows after cleaning: 10191
from collections import Counter
import time
import gc
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Function to process quotes in chunks
def process_quotes_chunk(chunk):
return Counter(' '.join(chunk).split())
# Load and process data in chunks
chunk_size = 1000
word_freq = Counter()
print("Processing quotes in chunks...")
start_time = time.time()
for i in range(0, len(df), chunk_size):
# Use .copy() to avoid the warning
chunk = df.iloc[i:i + chunk_size].copy()
chunk['filtered_quotes'] = chunk['filtered_quotes'].fillna('').astype(str)
chunk = chunk[chunk['filtered_quotes'] != '']
word_freq.update(process_quotes_chunk(chunk['filtered_quotes']))
# Clear memory
del chunk
gc.collect()
end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")
# Keep only frequent words
min_freq = 10
frequent_words = {word: count for word, count in word_freq.items() if count > min_freq}
print(f"Total unique words: {len(word_freq)}")
print(f"Words appearing more than {min_freq} times: {len(frequent_words)}")
# Create word cloud
wordcloud = WordCloud(width=1600, height=800,
background_color='white',
max_words=500,
relative_scaling=0.5).generate_from_frequencies(frequent_words)
plt.figure(figsize=(20, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Movie Quotes (Full Dataset)')
plt.show()
Processing quotes in chunks... Time taken: 20.25 seconds Total unique words: 883669 Words appearing more than 10 times: 121799
# Separate good and bad movies
median_score = df['average_combined_score'].median()
good_movies = df[df['average_combined_score'] > median_score]
bad_movies = df[df['average_combined_score'] <= median_score]
# Create bigram vectorizer
vectorizer = CountVectorizer(ngram_range=(2,2), stop_words='english')
# Get bigrams for good and bad movies
good_bigrams = vectorizer.fit_transform(good_movies['filtered_quotes'])
bad_bigrams = vectorizer.transform(bad_movies['filtered_quotes'])
# Calculate ratio of frequencies
good_freq = good_bigrams.sum(axis=0).A1
bad_freq = bad_bigrams.sum(axis=0).A1
ratio = np.log((good_freq + 1) / (bad_freq + 1))
# Get feature names
feature_names = vectorizer.get_feature_names_out()
# Sort by ratio and get top distinctive bigrams
top_good = sorted(zip(feature_names, ratio), key=lambda x: -x[1])[:20]
top_bad = sorted(zip(feature_names, ratio), key=lambda x: x[1])[:20]
print("Bigrams more associated with good movies:")
print(', '.join([f'"{bg}"' for bg, _ in top_good]))
print("\nBigrams more associated with bad movies:")
print(', '.join([f'"{bg}"' for bg, _ in top_bad]))
Bigrams more associated with good movies: "love hepburn", "crying baseball", "john cazale", "spider pig", "montoya killed", "dodge wrench", "dench cate", "bobsled team", "wrench dodge", "pig spider", "ford president", "downey kilmer", "toga toga", "battle iwo", "grant hepburn", "steve wiebe", "sperm sacred", "hanks denzel", "sandler cheadle", "joy division" Bigrams more associated with bad movies: "quack quack", "grant drew", "danielle harris", "li jason", "power glove", "chris kattan", "price bitch", "carrey riddler", "love duff", "dax shepard", "tommy jarvis", "walker jessica", "hilton died", "sandler nicholson", "carrey lee", "jason space", "jackman rachel", "hilton dies", "bob barker", "motherfucking snakes"
# Set this to True to create word clouds
CREATE_WORD_CLOUD = True
CUSTOM_STOPWORDS = set([
# General filler words that do not add distinctiveness
'get', 'got', 'make', 'made', 'see', 'go', 'going', 'take', 'want', 'know',
'look', 'find', 'remember', 'came', 'come', 'comes', 'getting', 'done',
'trying', 'need', 'kept', 'work', 'used', 'went', 'felt',
'little', 'big', 'best', 'good', 'fun', 'lot', 'lots', 'thing', 'things',
'everything', 'anything', 'nothing', 'play', 'played', 'plays', 'playing',
'part', 'parts', 'left', 'end', 'ending', 'line', 'based', 'version',
'sort', 'kind', 'seem', 'seems', 'feels', 'felt', 'place', 'places',
'wait', 'waiting', 'given', 'giving', 'took', 'turn', 'turns',
'moment', 'moments', 'viewing', 'john', 'stars', 'effects', 'keep',
'lines', 'story', 'plot', 'told', 'tell', 'telling', 'seen',
'sense', 'coming', 'couple', 'shows', 'piece', 'moving', 'mind', 'style',
'hate', 'lol', 'looking', 'hell', 'kid', 'night', 'game', 'jokes', 'kinda', 'wanted',
'stuff', 'let', 'girl', 'looks', 'b', 'hate', 'kid', 'night', 'movies', 'thriller', 'humor',
'lol', 'expect', 'tv', 'recommend', 'fans', 'someone', 'care', 'school', 'ass', 'death', 'start',
'michael', 'twist', 'seemed', 'rest', 'money', 'house', 'beginning', 'la',
'works', 'try', 'lost', 'others', 'genre', 'concept', 'guess', 'premise', 'expecting', 'creepy',
'jason', 'act', 'sex', 'animation', 'score', 'experience', 'written', 'hollywood', 'directed',
'brilliant', 'camera', 'violence', 'fight', 'hope', 'gave', 'read', 'understand', 'god', 'face', 'children',
# General and evaluative terms without specific value
'movies', 'time', 'way', 'characters', 'acting', 'scenes', 'cast', 'times',
'something', 'watched', 'actors', 'role', 'makes', 'say', 'book', 'series', 'films',
'performance', 'performances', 'job', 'music', 'fun', 'enjoyed', 'history', 'war',
'action', 'drama', 'gets', 'bit', 'enjoy', 'half', 'reason', 'fact', 'idea', 'script',
'scene', 'scenes', 'storyline', 'director', 'help', 'anyone', 'gives', 'course',
# Evaluative words without context
'oscar', 'masterpiece', 'genius', 'epic', 'superb', 'perfect', 'stunning',
'crap', 'shit', 'waste', 'beautiful', 'annoying', 'cheesy', 'great', 'funny', 'awesome',
'fantastic', 'bad', 'terrible', 'awful', 'wonderful', 'amazing', 'interesting',
# Overly broad descriptors
'man', 'woman', 'people', 'guys', 'girls', 'family', 'kids', 'ladies', 'men', 'friends',
'today', 'life', 'years', 'days', 'day', 'minutes', 'hours', 'age',
'cool', 'laugh', 'laughs', 'laughing', 'smile', 'cute', 'nice',
# Words without distinctive value in quality differentiation
'love', 'loved', 'comedy', 'horror', 'guy', 'everyone', 'show', 'fan', 'seeing',
'making', 'point', 'found', 'expected', 'said', 'give', 'put', 'takes',
'set', 'goes', 'year', 'heart', 'watching', 'world', 'direction', 'que', 'disney',
# Additional words to add to CUSTOM_STOPWORDS:
'franchise', 'james', 'tom', 'smith',
'problem', 'thinking', 'sequences', 'dont', 'better', 'jim',
'rock', 'kill', 'car', 'head', 'happened', 'use', 'person', 'ones',
'critics', 'save', 'keeps', 'stop', 'tries', 'message', 'points',
'supposed', 'stories', 'entertainment', 'starts', 'looked', 'started',
'okay', 'son', 'turned', 'kevin', 'landmark', 'oscars', 'division',
'firefly', 'pianist', 'mean', 'sucked', 'needed', 'worth', 'fucking',
'hated', 'disturbing', 'ok', 'sucks', 'laughed', 'happens', 'saying',
'needs', 'wants', 'taking', 'leave', 'cry', 'called', 'becomes', 'happen',
'run', 'gon', 'says', 'falls', 'become', 'taking', 'case', 'side', 'ways',
'gene', 'hitler', 'mozart', 'gandhi', 'brandon', 'howl', 'dean',
'grail', 'marx', 'salieri', 'kinski', 'hawks', 'lamotta', 'atticus',
'timeless', 'sunset', 'sunrise', 'boxing', 'ewoks', 'bjork',
'plainview', 'gael', 'spaghetti', 'attention', 'feeling', 'roles', 'type',
'name', 'thats', 'eyes',
'ends', 'hit', 'screen', 'move', 'cut', 'voice', 'ended', 'team',
'e', 'interest', 'cause', 'dance', 'taken', 'es', 'friend',
'elements', 'check', 'center', 'yes', 'knew', 'gangster', 'western', 'python', 'noir',
'mob', 'guinness', 'raging',
'criterion', 'vietnam', 'falcon', 'fargo', 'murrow', 'maude', 'harvey',
'spirited', 'strangelove', 'cagney', 'bueller', 'travis', 'bueller',
'rwanda', 'mcmurphy', 'falcon', 'landau', 'pekars', 'altman', 'stanley', 'criterion',
'criterion', 'criterion', 'fargo',
'musicals', 'technicolor', 'indemnity', 'masterpieces', 'barton',
'searchers', 'louise', 'rains', 'vienna', 'serenity', 'vito', 'pesci',
'braddock', 'kane', 'katharine', 'cleef', 'benigni', 'vivien',
'andrews', 'sergio', 'benigni', 'andrews', 'toshio',
'writing', 'die', 'stand', 'bring', 'alot', 'visuals', 'killed', 'wife',
'yeah', 'delivers', 'harry', 'lack', 'development', 'child', 'mark',
'change', 'video', 'chris', 'brings',
'img', 'decent', 'quality', 'dvd', 'surprise', 'scares', 'da', 'ideas',
'matter', 'damn', 'fell', 'villain', 'production', 'lead', 'casting',
'leads', 'expectations', 'relationship', 'ride', 'knows', 'effort',
'songs', 'parents', 'means', 'admit', 'fighting', 'dog', 'jackson',
'bill', 'please', 'attempt', 'lacks', 'dark', 'events', 'group', 'pretty',
'wonder', 'mess', 'level', 'pero', 'fails', 'song', 'wests', 'mcmurpow',
'chick', 'women', 'adam', 'boy', 'king', 'didnt',
])
# Set this to True to create word clouds
CREATE_WORD_CLOUD = True
# Combine with standard stopwords
ALL_STOPWORDS = set(STOPWORDS).union(CUSTOM_STOPWORDS)
OPWORDS = set(STOPWORDS).union(CUSTOM_STOPWORDS)
# Tokenizer to use for both positive and negative reviews
tokenizer = re.compile(r'\b\w+\b')
def process_data_in_chunks(file_path, chunk_size=1000):
positive_counter = Counter()
negative_counter = Counter()
median_score = None
total_rows = 0
print("Calculating median score...")
start_time = time.time()
for chunk in pd.read_csv(file_path, chunksize=chunk_size, encoding='utf-8'):
chunk = chunk[chunk['filtered_quotes'].notna() & (chunk['filtered_quotes'] != '')]
total_rows += len(chunk)
if median_score is None:
median_score = chunk['average_combined_score'].median()
print(f"Median score: {median_score}")
print(f"Total rows: {total_rows}")
print(f"Time taken to calculate median: {time.time() - start_time:.2f} seconds")
print("Processing quotes...")
start_time = time.time()
with tqdm(total=total_rows, unit='row') as pbar:
for chunk in pd.read_csv(file_path, chunksize=chunk_size, encoding='utf-8'):
chunk = chunk[chunk['filtered_quotes'].notna() & (chunk['filtered_quotes'] != '')]
positive_chunk = chunk[chunk['average_combined_score'] > median_score]['filtered_quotes']
negative_chunk = chunk[chunk['average_combined_score'] <= median_score]['filtered_quotes']
# Use the same tokenizer for both positive and negative quotes
positive_counter.update(word.lower() for quote in positive_chunk for word in tokenizer.findall(quote) if word.lower() not in ALL_STOPWORDS)
negative_counter.update(word.lower() for quote in negative_chunk for word in tokenizer.findall(quote) if word.lower() not in ALL_STOPWORDS)
pbar.update(len(chunk))
del chunk, positive_chunk, negative_chunk
gc.collect()
print(f"Time taken to process quotes: {time.time() - start_time:.2f} seconds")
return positive_counter, negative_counter
def compare_word_frequencies(counter1, counter2):
word_diff = Counter()
for word in set(counter1.keys()).union(counter2.keys()):
word_diff[word] = counter1.get(word, 0) - counter2.get(word, 0)
return word_diff
def create_word_cloud(counter, title):
if not CREATE_WORD_CLOUD:
print("Word cloud creation is disabled.")
return
try:
print(f"Generating word cloud for {title}...")
# Adjust the parameters to ensure all words are visible
wordcloud = WordCloud(
width=800,
height=400,
background_color='white',
max_words=200, # Increase to include more words
relative_scaling=0.5, # Adjust scaling to give less frequent words a better chance
min_font_size=5, # Ensure smaller words are still displayed
stopwords=set() # Avoid automatic filtering
).generate_from_frequencies(counter)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title(title)
plt.show()
plt.close()
print(f"Word cloud for {title} has been displayed.")
except Exception as e:
print(f"Failed to generate word cloud for {title}. Error: {e}")
def create_filtered_counter(word_diff, condition='positive', top_n=100):
filtered_counter = Counter()
if condition == 'positive':
filtered_words = {word: diff for word, diff in word_diff.items() if diff > 0}
elif condition == 'negative':
filtered_words = {word: -diff for word, diff in word_diff.items() if diff < 0}
else:
raise ValueError("Condition must be either 'positive' or 'negative'")
# Sort and get the top_n words
top_words = dict(sorted(filtered_words.items(), key=lambda x: -x[1])[:top_n])
filtered_counter.update(top_words)
return filtered_counter
def print_distinctive_words(counter, title):
distinctive_words = list(counter.keys())
print(f"\nDistinctive words for {title}:")
print(", ".join(distinctive_words))
# Main execution (updated for clarification)
start_time = time.time()
positive_counter, negative_counter = process_data_in_chunks('processed_movie_quotes.csv')
print("Comparison with stopwords:")
word_diff = compare_word_frequencies(positive_counter, negative_counter)
# Create filtered counters for distinctive words
distinctive_positive_counter = create_filtered_counter(word_diff, condition='positive', top_n=50)
distinctive_negative_counter = create_filtered_counter(word_diff, condition='negative', top_n=50)
# Print the distinctive words for your reference
print_distinctive_words(distinctive_positive_counter, "Highly-Rated Movies")
print_distinctive_words(distinctive_negative_counter, "Poorly-Rated Movies")
# Generate word clouds for distinctive positive and negative words
if CREATE_WORD_CLOUD:
create_word_cloud(distinctive_positive_counter, 'Word Cloud of Distinctive Words in Highly-Rated Movies')
create_word_cloud(distinctive_negative_counter, 'Word Cloud of Distinctive Words in Poorly-Rated Movies')
# Clear memory
del positive_counter, negative_counter
gc.collect()
print(f"Total execution time: {time.time() - start_time:.2f} seconds")
print("Script completed. Check the console output for results and any generated word clouds.")
Calculating median score... Median score: 8.452627739427342 Total rows: 10191 Time taken to calculate median: 6.45 seconds Processing quotes...
100%|██████████| 10191/10191 [00:43<00:00, 232.32row/s]
Time taken to process quotes: 43.87 seconds Comparison with stopwords: Distinctive words for Highly-Rated Movies: miyazaki, hepburn, pacino, hitchcock, kubrick, kurosawa, leone, bogart, chaplin, brando, goodfellas, godfather, wilder, stewart, welles, bergman, anime, coen, almodovar, newman, sellers, hayao, westerns, audrey, lemmon, lynch, departed, corleone, marlon, fellini, wayne, akira, peck, schindler, mifune, polanski, scarface, totoro, casablanca, mononoke, godard, samurai, capra, morricone, pixar, orson, bacall, ganz, ennio, bates Distinctive words for Poorly-Rated Movies: bond, sandler, cage, cruise, twists, gore, blood, chemistry, zombie, books, williams, killer, romance, cgi, reviews, jack, vampire, carrey, jennifer, ben, adventure, murphy, surprised, sequels, suspense, confusing, robin, hero, comedies, jones, con, childhood, batman, manages, christmas, chance, talking, power, acted, material, le, supporting, hand, city, scared, worked, gone, wow, running, missing Generating word cloud for Word Cloud of Distinctive Words in Highly-Rated Movies...
Word cloud for Word Cloud of Distinctive Words in Highly-Rated Movies has been displayed. Generating word cloud for Word Cloud of Distinctive Words in Poorly-Rated Movies...
Word cloud for Word Cloud of Distinctive Words in Poorly-Rated Movies has been displayed. Total execution time: 51.97 seconds Script completed. Check the console output for results and any generated word clouds.
# Top 20 word lists
highly_rated_top_20 = [
'miyazaki', 'hepburn', 'pacino', 'hitchcock', 'kubrick', 'kurosawa',
'leone', 'bogart', 'chaplin', 'brando', 'goodfellas', 'godfather',
'wilder', 'stewart', 'welles', 'bergman', 'anime', 'coen',
'almodovar', 'newman'
]
poorly_rated_top_20 = [
'bond', 'sandler', 'cage', 'cruise', 'twists', 'gore', 'blood',
'chemistry', 'zombie', 'williams', 'killer', 'romance', 'cgi',
'vampire', 'carrey', 'adventure', 'murphy', 'surprised',
'sequels', 'suspense', 'confusing'
]
# Create Counter objects for the word cloud
highly_rated_counter = Counter({word: 1 for word in highly_rated_top_20})
poorly_rated_counter = Counter({word: 1 for word in poorly_rated_top_20})
# Generate word clouds
create_word_cloud(highly_rated_counter, 'Word Cloud of Distinctive Words in Highly-Rated Movies')
Generating word cloud for Word Cloud of Distinctive Words in Highly-Rated Movies...
Word cloud for Word Cloud of Distinctive Words in Highly-Rated Movies has been displayed.
create_word_cloud(poorly_rated_counter, 'Word Cloud of Distinctive Words in Poorly-Rated Movies')
Generating word cloud for Word Cloud of Distinctive Words in Poorly-Rated Movies...
Word cloud for Word Cloud of Distinctive Words in Poorly-Rated Movies has been displayed.
5.2.3. NLP Results¶
The most distinctive words in highly-rated movies included notable directors like "Miyazaki" and "Hitchcock" as well as iconic actors like "Pacino" and "Hepburn." These names suggest that well-known talents significantly contribute to a movie's success. On the other hand, poorly-rated movies had themes like "gore," "zombie," and "sequels," indicating that certain genres or types of movies tend to receive more critical reviews.
Insights:
Highly-Rated Movies: The distinctive words associated with highly-rated movies included notable names such as "Miyazaki," "Hitchcock," "Kubrick," and "Bergman." This suggests that acclaimed directors and actors are frequently linked to movies that perform well. Themes like "anime" and "coen" indicate that certain genres or production styles resonate positively with audiences.
Poorly-Rated Movies: The most distinctive words for poorly-rated movies included "sandler," "cage," "gore," and "zombie." These words suggest that movies featuring certain actors or focusing on specific genres, such as horror, are more likely to receive critical reviews. Words like "sequels" and "twists" also imply that poorly-executed plot twists or over-reliance on sequels may contribute to lower ratings.
These insights help us understand the elements that audiences and critics find appealing or off-putting, providing valuable guidance for movie creators looking to enhance their content and reach wider acclaim.
6. Conclusion¶
This project has been an insightful exploration into the world of data, with significant tasks undertaken and key topics examined in depth. The work demonstrates advanced techniques in data cleaning, manipulation, and visualization. Functions were employed to optimize the code, reducing redundancy and improving overall readability. The data visualization component provided clear and immediate insights into the most popular movies in the industry, utilizing interactive graphs to enhance the presentation and efficiently manage space.
The "Data Science" section highlighted the power of analytical methods in uncovering patterns that are applicable to real-world scenarios.
Furthermore, merging multiple datasets into a cohesive whole was a key element of the project, showcasing an important skill in data analysis. This merged dataset was effectively used in the NLP section to analyze both critic and user reviews, delivering valuable insights into sentiment and opinion within the movie industry.